Cascaded Chinese Weibo Segmentation Based on CRFs
نویسندگان
چکیده
With the developments ofWeb2.0, the process for the data on Internet becomes necessary. This Paper reports our work for Chinese weibo segmentation in the 2012 CIPS-SIGHAN bakeoff. In order to improve the recognition accuracy of out-ofvocabulary words, we propose a cascaded model which first segments and disambiguates in-vocabulary words, then recovers out-of-vocabulary words from the fragments. Both the two process are trained by a character-based CRFs model with useredited external vocabulary. The final performance on the test data shows that our system achieves a promising result.
منابع مشابه
Exploiting Heterogeneous Annotations for Weibo Word Segmentation and POS Tagging
This paper describes our system designed for the NLPCC 2015 shared task on Chinese word segmentation (WS) and POS tagging for Weibo Text. We treat WS and POS tagging as two separate tasks and use a cascaded approach. Our major focus is how to effectively exploit multiple heterogeneous data to boost performance of statistical models. This work considers three sets of heterogeneous data, i.e., We...
متن کاملA Cascaded Linear Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
We propose a cascaded linear model for joint Chinese word segmentation and partof-speech tagging. With a character-based perceptron as the core, combined with realvalued features such as language models, the cascaded model is able to efficiently utilize knowledge sources that are inconvenient to incorporate into the perceptron directly. Experiments show that the cascaded model achieves improved...
متن کاملCRFs-Based Chinese Word Segmentation for Micro-Blog with Small-Scale Data
In this paper, we proposed a Chinese word segmentation model for micro-blog text. Although Conditional Random Fields (CRFs) models have been presented to deal with word segmentation, this is still the first time to apply it for the segmentation in the domain of Chinese micro-blog. Different from the genres of common articles, micro-blog has gradually become a new literary with the development o...
متن کاملA Dual-layer CRFs Based Joint Decoding Method for Cascaded Segmentation and Labeling Tasks
Many problems in NLP require solving a cascade of subtasks. Traditional pipeline approaches yield to error propagation and prohibit joint training/decoding between subtasks. Existing solutions to this problem do not guarantee non-violation of hard-constraints imposed by subtasks and thus give rise to inconsistent results, especially in cases where segmentation task precedes labeling task. We pr...
متن کاملRules-based Chinese Word Segmentation on MicroBlog for CIPS-SIGHAN on CLP2012
In this evaluation, we have taken part in the task of the Word Segmentation on Chinese MicroBlog. In this task, after analysing the feature of the MicroBlog and the result of our original Chinese word segmentation system, four Optimization Rules are proposed to optimize the segmentation algorithm for Chinese word segmentation on MicroBlog corpora. The optimized segmentation system is based on c...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012